[SPARK-26085][SQL] Key attribute of non-struct type under typed aggregation should be named as "key" too #23054

viirya · 2018-11-16T01:53:50Z

What changes were proposed in this pull request?

When doing typed aggregation on a Dataset, for struct key type, the key attribute is named as "key". But for non-struct type, the key attribute is named as "value". This key attribute should also be named as "key" for non-struct type.

How was this patch tested?

Added test.

viirya · 2018-11-16T01:53:58Z

cc @cloud-fan

cloud-fan · 2018-11-16T05:20:06Z

makes sense to me. This is a behavior change right? Shall we write a migration guide?

SparkQA · 2018-11-16T05:25:04Z

Test build #98891 has finished for PR 23054 at commit c7bbe91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-16T05:25:05Z

Ok. Let me update migration guide.

SparkQA · 2018-11-16T08:05:02Z

Test build #98902 has finished for PR 23054 at commit 42e32ad.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-16T08:18:12Z

retest this please

SparkQA · 2018-11-16T11:51:30Z

Test build #98907 has finished for PR 23054 at commit 42e32ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

docs/sql-migration-guide-upgrade.md

SparkQA · 2018-11-17T13:07:27Z

Test build #98961 has finished for PR 23054 at commit 2b697dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-11-17T17:37:13Z

We should add a “legacy” flag in case somebody’s workload gets broken by this. We can remove the legacy flag in a future release.

viirya · 2018-11-18T02:26:04Z

Ok. I will add a flag. Thanks @rxin

SparkQA · 2018-11-18T11:29:48Z

Test build #98977 has finished for PR 23054 at commit 6e3c37a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-18T11:37:35Z

retest this please.

SparkQA · 2018-11-18T13:15:32Z

Test build #98978 has finished for PR 23054 at commit 6e3c37a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-18T13:31:35Z

retest this please.

SparkQA · 2018-11-18T13:51:59Z

Test build #98976 has finished for PR 23054 at commit 4f85876.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-18T20:12:36Z

Test build #98981 has finished for PR 23054 at commit 6e3c37a.

This patch fails from timeout after a configured wait of 400m.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-11-18T20:19:40Z

BTW what does the non-primitive types look like? Do they get flattened, or is there a struct ?

viirya · 2018-11-19T00:11:31Z

For struct types there is a struct named "key".

cloud-fan · 2018-11-19T01:35:53Z

docs/sql-migration-guide-upgrade.md


  - The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set.

+  - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is atomic type, e.g. int, string, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.atomicKeyAttributeGroupByKey` with a default value of `false`.


I realized that, only struct type key has the key alias. So here we should say: if the key is non-struct type, e.g. int, string, array, etc.

Ok. More accurate.

cloud-fan · 2018-11-19T01:37:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(false)
+
+  val LEGACY_ATOMIC_KEY_ATTRIBUTE_GROUP_BY_KEY =
+    buildConf("spark.sql.legacy.atomicKeyAttributeGroupByKey")


spark.sql.legacy.dataset.aliasNonStructGroupingKey?

cloud-fan · 2018-11-19T01:53:25Z

sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala

    val keyColumn = if (!kExprEnc.isSerializedAsStruct) {
      assert(groupingAttributes.length == 1)
-      groupingAttributes.head
+      if (SQLConf.get.aliasNonStructGroupingKey) {


we should do the alias when config is true...

hmm, don't we want to have "key" attribute and only have old "value" attribute when we turn on legacy config?

SparkQA · 2018-11-19T05:23:45Z

Test build #98988 has finished for PR 23054 at commit b5cfda4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-11-19T10:39:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(false)
+
+  val LEGACY_ALIAS_NON_STRUCT_GROUPING_KEY =
+    buildConf("spark.sql.legacy.dataset.aliasNonStructGroupingKey")


Maybe aliasNonStructGroupingKeyAsValue, and default to false.

Then we can remove this in the future.

Ok. That makes sense. Thanks.

SparkQA · 2018-11-19T21:55:38Z

Test build #99009 has finished for PR 23054 at commit 3930a35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-19T22:07:26Z

Test build #99010 has finished for PR 23054 at commit f58e93e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-20T01:38:17Z

sorry it conflicts, can you resolve it? I think it's ready to go

viirya · 2018-11-20T02:26:43Z

@cloud-fan Yea, it's resolved. Thanks.

SparkQA · 2018-11-20T05:59:35Z

Test build #99035 has finished for PR 23054 at commit 0ffdb4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

SparkQA · 2018-11-21T05:42:48Z

Test build #99096 has finished for PR 23054 at commit d784142.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-21T06:26:40Z

retest this please

SparkQA · 2018-11-21T08:05:01Z

Test build #99097 has finished for PR 23054 at commit d784142.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-21T08:13:31Z

retest this please...

SparkQA · 2018-11-21T11:57:26Z

Test build #99103 has finished for PR 23054 at commit d784142.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-11-21T17:28:59Z

hmmm it conflicts again...

viirya · 2018-11-21T22:57:43Z

yea, resolved again. :)

SparkQA · 2018-11-22T02:47:31Z

Test build #99146 has finished for PR 23054 at commit 70cab8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class RuleSummary(
class QueryPlanningTracker
class QueryExecution(

cloud-fan · 2018-11-22T02:51:25Z

thanks, merging to master!

…gation should be named as "key" too ## What changes were proposed in this pull request? When doing typed aggregation on a Dataset, for struct key type, the key attribute is named as "key". But for non-struct type, the key attribute is named as "value". This key attribute should also be named as "key" for non-struct type. ## How was this patch tested? Added test. Closes apache#23054 from viirya/SPARK-26085. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Named key attribute for primitive type as "key".

c7bbe91

Update migration guide.

42e32ad

cloud-fan reviewed Nov 17, 2018

View reviewed changes

docs/sql-migration-guide-upgrade.md Outdated Show resolved Hide resolved

Update migration guide.

2b697dc

srowen approved these changes Nov 17, 2018

View reviewed changes

Add legacy flag.

6e3c37a

viirya force-pushed the SPARK-26085 branch from 4f85876 to 6e3c37a Compare November 18, 2018 09:59

cloud-fan reviewed Nov 19, 2018

View reviewed changes

Address comments.

b5cfda4

viirya changed the title ~~[SPARK-26085][SQL] Key attribute of primitive type under typed aggregation should be named as "key" too~~ [SPARK-26085][SQL] Key attribute of non-struct type under typed aggregation should be named as "key" too Nov 19, 2018

cloud-fan reviewed Nov 19, 2018

View reviewed changes

rxin reviewed Nov 19, 2018

View reviewed changes

Rename config name.

3930a35

viirya force-pushed the SPARK-26085 branch from f58e93e to 3930a35 Compare November 19, 2018 11:04

Merge remote-tracking branch 'upstream/master' into SPARK-26085

0ffdb4b

cloud-fan reviewed Nov 21, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

Update config name.

d784142

Merge remote-tracking branch 'upstream/master' into SPARK-26085

70cab8b

asfgit closed this in ab2eafb Nov 22, 2018

viirya deleted the SPARK-26085 branch December 27, 2023 18:22


		- The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set.

		- In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is atomic type, e.g. int, string, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.atomicKeyAttributeGroupByKey` with a default value of `false`.

[SPARK-26085][SQL] Key attribute of non-struct type under typed aggregation should be named as "key" too #23054

[SPARK-26085][SQL] Key attribute of non-struct type under typed aggregation should be named as "key" too #23054

Uh oh!

Conversation

viirya commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

viirya commented Nov 16, 2018

Uh oh!

cloud-fan commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

viirya commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

HyukjinKwon commented Nov 16, 2018

Uh oh!

SparkQA commented Nov 16, 2018

Uh oh!

Uh oh!

SparkQA commented Nov 17, 2018

Uh oh!

rxin commented Nov 17, 2018

Uh oh!

viirya commented Nov 18, 2018

Uh oh!

SparkQA commented Nov 18, 2018

Uh oh!

viirya commented Nov 18, 2018

Uh oh!

SparkQA commented Nov 18, 2018

Uh oh!

viirya commented Nov 18, 2018

Uh oh!

SparkQA commented Nov 18, 2018

Uh oh!

SparkQA commented Nov 18, 2018

Uh oh!

rxin commented Nov 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan Nov 19, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Nov 19, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 19, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 19, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

rxin Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 19, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

SparkQA commented Nov 19, 2018

Uh oh!

cloud-fan commented Nov 20, 2018

Uh oh!

viirya commented Nov 20, 2018

viirya commented Nov 16, 2018 •

edited

Loading

rxin commented Nov 18, 2018 •

edited

Loading

viirya commented Nov 19, 2018 •

edited

Loading

cloud-fan Nov 19, 2018 •

edited

Loading

rxin Nov 19, 2018 •

edited

Loading